|
Why do we need patterns?We need patterns to describe text that is inexact, such as 'all words starting with s', 'all words starting with t and ending with m', or 'all 9 digit numbers'. We might be able to list all the possible variations, but it would be impractical and annoying. We use a pattern matching language to describe a pattern. A pattern matching language uses symbols to describe both normal characters (that are matched exactly) and meta characters (that describe special operations, such as alternatives and repeating sections). Patterns are almost always described using what is called a regular expression, so called because the pattern matching language fulfills some mathematical properties that we don't need to worry about. Patterns can become very complicated very quickly, so let's start with some simple examples. We've all done a search and replace where we don't care about the case of the search string (Case Sensitive is Off). For example, if I search for 'Car', I also want to search for 'car, 'CAR' and any other case variations. We are implicitly searching for an upper or lower case 'C', followed by an upper or lower case 'A' followed by an upper or lower case 'R'. This is an example of a simple pattern. In a pattern matching language, we could express this as follows:
Each set of square brackets is a character class, meaning 'find one of any of these characters' - a 'C' or 'c', then a 'A' or 'a', then a 'R' or 'r'. If we changed this pattern to
then we also could find 'Bar', 'bar' and 'BAR' (and others). Of course, if we just toggled the Case Sensitivity flag then it wouldn't matter if we used
Note that the letters outside of the square brackets simply represent themselves - they are not special. Here are some more character classes:
TextPipe also provides some convenient short cuts for commonly used classes. They may be used either on their own or inside a character class.
You can also describe 'special' or unprintable characters using the following patterns:
Repetition - quantifiersOften you are looking for a pattern that is repeated. TextPipe provides a number of different methods to specify how many times a pattern is allowed to repeat. Each method is called a quantifier. The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example:
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus
matches at least 3 successive vowels, but may match many more, while
matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters. For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations:
Normally TextPipe will try to repeat matches as many times as possible - this is known as greedy (or maximal) matching. Using the Search and Replace Pattern Options dialog you can change this to non-greedy (or minimal) matching. If a quantifier is followed by a question mark, it inverts the default 'greediness' of the quantifier.
Using these quantifiers we can now build the following examples:
AlternativesVertical bar ('pipe') characters are used to separate alternative patterns. For example, the pattern
matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a sub pattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the sub pattern. Positional matching or anchorsSometime you only want a pattern to match when it is found at the start of a file, or the end of a line, or when it's a whole word. Various positional operators force the match to only occur when it's found in a specific position.
Sub Patterns and the Replacement StringSub patterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a sub pattern does two things: 1. It localizes a set of alternatives. For example, the pattern
matches one of the words "cat", "cataract", or "caterpillar". Without the parentheses, it would match "cataract", "erpillar" or the empty string. 2. It sets up the sub pattern as a capturing sub pattern. The text matches by a capturing sub pattern can be referred to later (such as in the replacement string) using the macros $0 (for the full matching text), $1 (for the first sub pattern), $2 ... $9, $a ...$z etc. Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing sub patterns. For example, if the string "the red king" is matched against the pattern
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3. The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping sub pattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the sub pattern does not do any capturing, and is not counted when computing the number of any subsequent capturing sub patterns. For example, if the string "the white queen" is matched against the pattern
the captured sub strings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured sub strings is 36 (0-9, a-z), and the maximum number of all sub patterns, both capturing and non-capturing, is 200. Using parentheses appropriately in the search expression lets the program remember found text, to be used as replacement text. The simplest example of this, without the need for parentheses, is the complete found string, represented by the ‘$0’ character in the replacement string. For example, if your search Regular Expression was ‘test|trial|experiment’, and your replacement string was ‘<b>$0</b>’, every instance of the word ‘test’ in your document would be replaced by ‘<b>test</b>’, and similarly for ‘trial’ and ‘experiment’ (assuming this was an HTML document, this would have the effect of bolding these words). Note that if you want to include an actual ‘$’ in your replacement text, escape it, as in, ‘$$’. You can use parentheses to remember specific parts of the found text. For example, the Regular Expression ‘<img src="([^"]+)">’ would find and match image tags in an HTML document (assuming they were formed exactly like this), and would remember the source image file. You can recall this by using ‘$1’ in your replacement string. So if your replacement string was ‘-- Image File: $1 --’, and the HTML file processed contained the string ‘<img src="/images/test.gif">’, that string would be replaced by ‘-- Image File: /images/test.gif --’. You can remember multiple parts of the found text. For example, the Regular Expression ‘<img src="([^"]+)" alt="([^"]+)">’ would find and match a string such as ‘<img src="/images/test.gif" alt="My Cool Image">’. If the replacement string was ‘$2 ($1)’, then this image would be replaced by ‘My Cool Image (/images/test.gif)’. PerformanceThe time which a search for a Regular Expressions takes can range from unimportant to unbearably slow. Seemingly small changes in a Regular Expression can make a world of difference. For example, the regular expression
finds and matches quoted strings (which it remembers), as well as surrounding text without quotes. This particular Regular Expression executes quickly enough (counted in seconds, or less) if the file being processed actually has quoted strings in it. If, however, the file is of a reasonable size (say, 50 KB), and does not have any quoted strings in it, the search for this regular expression will take an incredible amount of time to complete – maybe minutes! It turns out that it is the first piece of the expression that is causing the problem. Removing it makes the search on the 50 KB file nearly instantaneous. Why? Because now the search is smart enough to realize that before it bothers matching anything else, it must at least find a double-quote. When it doesn’t find one, it’s finished it’s search. So one trick to making fast regular expressions, is to form the regular expression in such a way that it can fail as early as possible. Try to put strings that must be matched, right at the beginning of the regular expression. Apart from this, experiment and see what works well and what doesn’t. Things To Note - Escaping Meta charactersIf you want to find any of the meta characters on their own, you must escape them with a backslash to prevent them being interpreted. When in doubt, quote all non-alphanumeric characters. The meta characters are:
Further readingFor more information on performance, assertions, back references, limitations, POSIX character classes and more details on the above text, please see the Perl/PCRE pattern matching reference.
|
Contact
Us
Support
Community
Tutorials and User Guides (online) |